import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import acf as acf_func
"ggplot")
plt.style.use(import yfinance as yf
import statsmodels.tsa.stattools as stattools
import statsmodels.api as sm
Cointegration means if two series are unstationary, but a linear combination (usually a difference) of them can be a stationary series. A more formal definition below.
If \(\{x_t\}\) and \(\{y_t\}\) are two non-stationary time series, if the linear combination of them, i.e. \(ax_t+by_t\) is stationary, then we say \(\{x_t\}\) and \(\{y_t\}\) are cointegrated.
Here is an example, both generated series are unstationary, with the same slope but different constant terms.
The common test is Augmented Engle-Granger test, with hypotheses \[ H_0: \text{No cointegration}\\ H_1: \text{Cointegration presents} \]
= 500
n_samples = np.arange(n_samples)
x = 500 + 1.5 * x + np.random.normal(0, 50, size=n_samples)
y1 = 5 + 1.5 * x + np.random.normal(0, 50, size=n_samples)
y2
= acf_func(y1 - y2, nlags=50)
y1_m_y2_acf = plt.subplots(figsize=(12, 8), nrows=3, ncols=1)
fig, ax 0].plot(y1)
ax[0].plot(y2)
ax[1].plot(y1 - y2, color="ForestGreen")
ax[2].bar(np.arange(len(y1_m_y2_acf)), y1_m_y2_acf)
ax[
plt.show()
= stattools.coint(y1, y2)
result print("p-value of Augmented Engle-Granger test:{}".format(result[1]))
p-value of Augmented Engle-Granger test:1.0275717709224409e-25
The difference between two series, i.e.\([1, -1]\), is a most common linear combination we use in cointegration, and of course the test result shows predominant evidence to support cointegration because the models are designed to be so.
Why cointegration is useful for trading? Because it’s one of mathematical foundation of mean reverting strategies, and we can construct a ‘synthetic’ stationary series with a combination of any arbitrary amount of instruments.
= yf.download(
df "USO", "XOM"],
[="2019-01-01",
start="2020-01-01",
end=True,
progress="inline",
actions="1d",
interval )
[ 0% ][*********************100%***********************] 2 of 2 completed
= df["Adj Close"]
df = ["USO", "XOM"] df.columns
Linear regression between two indices \[ y_t =\beta_1 + \beta_2 x_t + \varepsilon_t \]
= df["XOM"]
X = sm.add_constant(df["XOM"]) # adding a constant
X = df["USO"]
Y = sm.OLS(Y, X).fit() results
This is residual plot \[ \hat{\varepsilon_t} = \hat{y}_t - \beta_1 - \beta_2 x_t \]
Cointegrated Augmented Dickey-Fuller (CADF) could be useful here. It determines the optimal hedge ratio by performing a linear regression against the two time series.
results.params
const 23.198037
XOM 1.263767
dtype: float64
= stattools.adfuller(results.resid) cadf
print("ADF Statistic: %f" % cadf[0])
print("p-value: %f" % cadf[1])
ADF Statistic: -2.891340
p-value: 0.046364
The null hypothesis of the Augmented Dickey-Fuller is that there is a unit root, with the alternative that there is no unit root.